Apple NPU acceleration integrated into llama.cpp, using MiniCPM-V 4.0 as an example. #15262

tc-mb · 2025-08-12T08:39:05Z

As stated in #14983, I have integrated Apple NPU (ANE) acceleration into llama.cpp.

Using MiniCPM-V 4.0 as an example, I will introduce a simple way to use ANE and hope we can discuss a better approach.

Build llama.cpp locally，I added an -DENABLE_COREML option to control whether ANE is used.

cmake -B build --DENABLE_COREML=ON
cmake --build build --config Release -j 8

Download ane in Hugging Face or Modelscope, If you downloaded the zip file, please unzip it.
Used like mmproj, I added the "--ane" interface. The path is the downloaded ane_minicpmv4_vit_f16.mlmodelc file address.

./build/bin/llama-mtmd-cli -m {dir_path}/ggml-model-Q4_0.gguf --mmproj {dir_path}/mmproj-model-f16.gguf --ane {dir_path}/ane_minicpmv4_vit_f16.mlmodelc -c 4096 --temp 0.7 --top-p 0.8 --top-k 100 --repeat-penalty 1.05 --image {dir_path}/xx.png -p "Describe the content of the image in detail."

I tested ANE acceleration on several devices. The benchmark results are as follows:

mac M2		image size		MiniCPM-V 4.0(ANE)	MiniCPM-V 4.0
q4_K_M	1	448×448	prefill time(ms)	790.26	5716.77
	2	600×600	prefill time(ms)	1894.24	17961.35
	3	700×700	prefill time(ms)	2954.34	27866.59
	4	800×800	prefill time(ms)	2964.44	27946.48
	5	1024×625	prefill time(ms)	2977.56	30111.43
	6	1024×768	prefill time(ms)	2975.98	30415.11
	7	1280×960	prefill time(ms)	4065.79	41889.12
mac M4				MiniCPM-V 4.0(ane)	MiniCPM-V 4.0
q4_K_M	1	448×448	prefill time(ms)	412.57	736.57
	2	600×600	prefill time(ms)	989.44	3365.09
	3	700×700	prefill time(ms)	1564.61	4031.90
	4	800×800	prefill time(ms)	1555.85	4124.81
	5	1024×625	prefill time(ms)	1563.65	5405.13
	6	1024×768	prefill time(ms)	1567.45	5169.05
	7	1280×960	prefill time(ms)	2141.54	7544.96

A point worth noting: The first time ANE is used, there is a loading time and it will be slightly slower. After that, as long as ANE is not updated, it will remain ready and waiting in the system.

Feat ios

feat ios: add clean kv cache

ggerganov

Generally looks OK. Need to improve encapsulation of the CoreML code (see comments). Would need a review from @ngxson.

Also:

Use "CoreML" instead of "ANE"
Would eventually need instructions for generating the CoreML inference code - can add those after the PR is approved

tools/mtmd/clip.h

ggerganov · 2025-08-12T10:33:53Z

tools/mtmd/clip.h

+
+// ANE support functions
+void clip_set_ane_model_path(struct clip_ctx * ctx, const char * ane_model_path);


We should find a way to avoid this. Maybe we can do something similar to whisper.cpp:

https://github.com/ggml-org/whisper.cpp/blob/f7502dca872866a310fe69d30b163fa87d256319/src/whisper.cpp#L3351-L3373

tools/mtmd/mtmd.h

ggerganov · 2025-08-12T10:37:16Z

tools/mtmd/clip.cpp

+
+    static int flag = 0;
+    static const void* coremlEncoder = NULL;
+    static std::string cached_model_path = "";
+
+    // Check if we need to load a new model
+    if (flag == 0 || (ane_model_path && cached_model_path != ane_model_path)) {
+        if (coremlEncoder) {


Avoid this global state. Figure out a way to move this to the clip context.

ngxson

The global idea is good. However, I think we should take time to make sure this can be useful in the long term.

The biggest issue atm is that many TODO are being copied in the PR, which will make refactoring very difficult in the future. We must resolve this problem first.

Related to UX, if we cannot have embeddings and resampler all in one CoreML model, I think we should separate it into 2 repos on hugging face or modelscope. One having only ggml implementation and one have CoreML. Having everything in the same place seems very confusing for most users, and most of them don't even have time to look at this PR.

tools/mtmd/clip.cpp

ngxson · 2025-08-13T09:38:45Z

tools/mtmd/clip.cpp

+        ane_embedding(ctx, n_threads, &imgs, vit_embedding1);
+        clip_image_encode_ane(vit_embedding1, vit_embedding2, ctx->ane_model_path.c_str());
+        ane_resampler(ctx, n_threads, &imgs, vit_embedding2, vec);


Seems like only the ViT part is done by ANE, the rest (embeddings, resampler) is sill done by ggml. Any reason why we can't do the rest with ANE? I think it could be a cleaner approach as we can now be able to load only .mlmodelc file and no more mmproj.gguf file.

Also, maybe we should try ggml_custom_4d and inject the clip_image_encode_ane as a node on ggml cgraph. If that works, it will make everything looks much cleaner. Do you think it's a valid use case of ggml_custom_4d @ggerganov ?

@ngxson Yes, only the vit is currently being replaced with ane now.
Because the embed calculations aren't yet correctly calculated with ane, I've bypassed the two embed calculations and only replaced the vit itself.
I'm also still trying other methods to see if there's a solution.

Also, maybe we should try ggml_custom_4d and inject the clip_image_encode_ane as a node on ggml cgraph. If that works, it will make everything looks much cleaner. Do you think it's a valid use case of ggml_custom_4d

Haven't considered such use case for ggml_custom_4d. Sounds like worth exploring.

@ngxson I'm sorry I was delayed a bit last week while preparing for the release of V4.5.
However, your suggestion reminded me, and I've found a way to convert the entire VIT to CoreML. This will require a lot of changes, though, so I'll probably submit it next week.

@ngxson I have integrated all vit+resampler into coreml for calculation. The code has been modified. Please review it in your free time.

tc-mb · 2025-08-13T10:34:16Z

@ggerganov @ngxson Yes, I understand that introducing a new feature requires more time to discuss its design, including its name, structure, and interface definition. All of this takes time. I have plenty of time to prepare for this. I will follow the discussion and ensure that this feature is incorporated into llama.cpp in a proper manner.

Signed-off-by: tc-mb <[email protected]>

tc-mb and others added 17 commits July 7, 2025 14:58

support minicpm-v 4

f37f8a9

ane test

220ad75

test code for ios

8898862

support app s1

999a87d

feat: mtmd support xcframework

4d32fd2

fix for app

8775bd5

temp no use slice

2e7bcd3

fix no use slice

d4f0cfe

Merge remote-tracking branch 'upstream/tmp_project_i' into feat_ios

13efcf4

Merge pull request #38 from tc-mb/feat_ios

93f4086

Feat ios

feat: add clean kv cache

ea69103

Merge pull request #39 from tc-mb/feat_ios_0721

e468802

feat ios: add clean kv cache

rename ane

2fd1ef7

update comments

864d013

optimized interface

54258e9

merge ane first, temp rm app support

629b625

add file existence check

701aff9

github-actions bot added examples python python script changes labels Aug 12, 2025

Merge branch 'master' into Integrate-ANE-support-into-llama.cpp

0042f12

ggerganov reviewed Aug 12, 2025

View reviewed changes

fix for commit step1

9eee52c

ngxson reviewed Aug 13, 2025

View reviewed changes

tc-mb and others added 5 commits August 14, 2025 18:18

Replace ane with coreml; replace malloc with std::vector<float>

fd64e45

vit+resampler to coreml

1a52c75

Signed-off-by: tc-mb <[email protected]>

Merge branch 'master' into Integrate-ANE-support-into-llama.cpp

b67f081

clean old code

215bb4e

Signed-off-by: tc-mb <[email protected]>

optimize code

abd8ace

Signed-off-by: tc-mb <[email protected]>


		// ANE support functions
		void clip_set_ane_model_path(struct clip_ctx * ctx, const char * ane_model_path);

Apple NPU acceleration integrated into llama.cpp, using MiniCPM-V 4.0 as an example. #15262

Are you sure you want to change the base?

Apple NPU acceleration integrated into llama.cpp, using MiniCPM-V 4.0 as an example. #15262

Uh oh!

Conversation

tc-mb commented Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ngxson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tc-mb commented Aug 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tc-mb commented Aug 12, 2025 •

edited

Loading